\( \newcommand{\water}{{\rm H_{2}O}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\grad}{\nabla} \newcommand{\T}{^\text{T}} \newcommand{\mathbbone}{\unicode{x1D7D9}} \renewcommand{\:}{\enspace} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Tr}{Tr} \newcommand{\norm}[1]{\lVert #1\rVert} \newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) } \newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.} \)

Score function estimator: A Monte Carlo gradient estimator

Problem

Suppose you have to compute the gradient of an expectation:

\[ \nabla_\phi \; \mathbb{E}_{q_\phi(z)} \big[ f(z) \big] \]

where, problematically, the gradient \(\nabla_\phi\) is with respect to the parameters of the distribution \(q_\phi(z)\), so we cannot bring the gradient into the expectation.

Idea for solution

The score function estimator is the naive way to transform the problematic gradient above into an expectation that is easy to compute. It is called so because of the score function \(\; S_n (\theta) = \nabla_\theta \, \sum_{i=1}^n \log p(x_i \,|\, \theta) \;\).

It is the naive way because it simply brings the gradient into the integral, and transforms the integral into a new expectation by getting an extra \(\; q_\phi(z) \;\) factor. The extra \(\; q_\phi(z) \;\) is obtained from the gradient of the log.

Solution

\begin{align} \nabla_\phi \; \mathbb{E}_{q_\phi(z)} \big[ f(z) \big] & = \nabla_\phi \int f(z) \, q_\phi(z) \, \text{d}z \\[4pt] & = \int f(z) \; \nabla_\phi \; q_\phi(z) \,\text{d}z \quad \quad \quad \text{(regularity assumption)} \\[4pt] & = \int f(z) \; \frac{1}{q_\phi(z)} \, \nabla_\phi\, q_\phi(z) \;\, q_\phi(z) \text{d}z \\[15pt] & = \mathbb{E}_{q_\phi(z)}\, \big[ f(z) \; \nabla_\phi\, \log q_\phi(z) \big] \end{align}

Then, we can estimate the expectation by Monte Carlo approximation, sampling \(z_i \sim q_\phi(z), \; i =1, \dots, n \) and computing the mean. However, the score function estimator has very high variance, so it is not useful for many applications.

Performance

The score function estimator usually produces gradient estimates with high variance, so it is not useful for many applications. Better estimates are usually obtained with the reparameterization trick.